We propose a novel deep neural network architecture to learn interpretable representation for medical image analysis. Our architecture generates a global attention for region of interest, and then learns bag of words style deep feature embeddings with local attention. The global, and local feature maps are combined using a contemporary transformer architecture for highly accurate Gallbladder Cancer (GBC) detection from Ultrasound (USG) images. Our experiments indicate that the detection accuracy of our model beats even human radiologists, and advocates its use as the second reader for GBC diagnosis. Bag of words embeddings allow our model to be probed for generating interpretable explanations for GBC detection consistent with the ones reported in medical literature. We show that the proposed model not only helps understand decisions of neural network models but also aids in discovery of new visual features relevant to the diagnosis of GBC. Source-code and model will be available at https://github.com/sbasu276/RadFormer
已知现代深度神经网络模型将错误地将分布式(OOD)测试数据分类为具有很高信心的分数(ID)培训课程之一。这可能会对关键安全应用产生灾难性的后果。一种流行的缓解策略是训练单独的分类器,该分类器可以在测试时间检测此类OOD样本。在大多数实际设置中,在火车时间尚不清楚OOD的示例,因此,一个关键问题是:如何使用合成OOD样品来增加ID数据以训练这样的OOD检测器?在本文中,我们为称为CNC的OOD数据增强提出了一种新颖的复合腐败技术。 CNC的主要优点之一是,除了培训集外,它不需要任何固定数据。此外,与当前的最新技术(SOTA)技术不同,CNC不需要在测试时间进行反向传播或结合,从而使我们的方法在推断时更快。我们与过去4年中主要会议的20种方法进行了广泛的比较,表明,在OOD检测准确性和推理时间方面,使用基于CNC的数据增强训练的模型都胜过SOTA。我们包括详细的事后分析,以研究我们方法成功的原因,并确定CNC样本的较高相对熵和多样性是可能的原因。我们还通过对二维数据集进行零件分解分析提供理论见解,以揭示(视觉和定量),我们的方法导致ID类别周围的边界更紧密,从而更好地检测了OOD样品。源代码链接:https://github.com/cnc-ood
丰富的时间信息和视角中的变化使视频数据成为使用无监督的对比度学习(UCL)技术学习图像表示的有吸引力的选择。最先进的(SOTA)对比度学习技术将视频中的帧视为嵌入空间中的阳性,而其他视频的框架则被视为负面因素。我们观察到,与自然场景视频中对象的多种视图不同,超声(US)视频捕获了器官的不同2D片。因此,即使是相同的美国视频的暂时遥远框架之间几乎没有相似之处。在本文中,我们建议相反使用诸如硬底面的框架。我们主张在UCL框架中对硬度敏感的负挖掘课程进行挖掘,并在硬度敏感的负面挖掘课程中挖掘,以学习丰富的图像表示。我们部署框架以从美国视频中学习胆囊(GB)恶性肿瘤的表示。我们还构建了第一个大型US视频数据集,其中包含64个视频和15,800帧,用于学习GB表示。我们表明,经过我们框架训练的标准RESNET50骨干线可以提高使用SOTA UCL技术预测的模型的准确性,并在Imagenet上对ImageNet上的有监督的预处理模型提高了GB恶性检测任务的预期模型,提高了2-6%。我们进一步验证了方法在COVID-19病理的公开肺图像数据集上的普遍性,与SOTA相比,改善了1.5%。源代码,数据集和模型可在https://gbc-iitd.github.io/usucl上找到。
过程学习涉及确定键步并确定其逻辑顺序以执行任务。现有方法通常使用第三人称视频来学习该过程,使操纵对象的外观很小,并且经常被演员遮住,从而导致重大错误。相比之下,我们观察到从第一人称(Egentric)可穿戴摄像机获得的视频提供了对动作的毫无开创且清晰的视野。但是,从以eg中心视频学习的程序学习是具有挑战性的,因为(a)由于佩戴者的头部运动,相机视图发生了极端变化,并且(b)由于视频的不受约束性质而存在无关的框架。因此,当前的最新方法的假设是,该动作大约同时发生并且持续时间相同,因此不持有。取而代之的是,我们建议使用视频键位之间的时间对应关系提供的信号。为此,我们提出了一个新颖的自我监督对应和剪切(CNC),用于程序学习。 CNC识别并利用多个视频的键步之间的时间对应关系来学习该过程。我们的实验表明,CNC的表现分别优于基准Procel和Crosstask数据集上的最先进,分别为5.2%和6.3%。此外,对于使用以Egentric视频为中心的程序学习,我们建议使用Egoprocel数据集,该数据集由130名受试者捕获的62个小时的视频组成,执行16个任务。源代码和数据集可在项目页面https://sid2697.github.io/egoprocel/上获得。
基于注意力的模型(例如变压器)在密集的预测任务(例如语义分割)上表现出出色的性能,因为它们可以捕获图像中的长期依赖性。但是,到目前为止,很少探索变压器对单眼深度预测的好处。本文基于室内NYUV2数据集和室外KITTI数据集的深度估计任务的各种基于变压器的模型。我们提出了一种新型的基于注意力的架构,即单眼深度估计的深度构建器,该估计使用多头自我注意力来生成多尺度特征图,这些图由我们提出的解码器网络有效地组合。我们还提出了一个跨键模块,该模块将深度范围划分为每个图像可自适应估计的中心值的垃圾箱。估计的最终深度是每个像素的垃圾箱中心的线性组合。 TransBins模块在编码阶段使用变压器模块利用全局接收场。 NYUV2和KITTI深度估计基准的实验结果表明,我们提出的方法将最新方法提高了3.3%,在根平方误差(RMSE)方面分别将最新方法提高了3.3%。
我们介绍了Tapaphsir,这是一种用于要求歧义性检测的工具和要求的要求。 Taphsir设施审查了在要求规范中使用代词的使用,并修改了可能导致开发过程中误解的代词。为此,Taphsir检测到具有潜在的放置歧义的要求,并尝试自动解释过度发生。 Taphsir采用了由基于机器学习的歧义检测解决方案和基于BERT语言模型变体的Anaphora分辨率解决方案组成的混合解决方案。给定需求规范,taphsir在规范中为每个代词的出现决定代词是模棱两可的还是明确的,并且进一步为代词提供了自动解释。 Taphsir产生的输出可以由需求工程师轻松审查和验证。 Taphsir可在Zenodo(doi:10.5281/Zenodo.5902117)上公开使用。
几次拍摄对象检测(FSOD)仅定位并在图像中分类对象仅给出一些数据样本。最近的FSOD研究趋势显示了公制和元学习技术的采用,这易于灾难性的遗忘和课堂混乱。为了克服基于度量学习的FSOD技术的这些陷阱,我们介绍了引入引导的余弦余量(AGCM),这有助于在对象检测器的分类头中创建更严格和良好的分离类特征群集。我们的新型专注提案融合(APF)模块通过降低共同发生的课程中的阶级差异来最大限度地减少灾难性遗忘。与此同时,拟议的余弦保证金交叉熵损失增加了混淆课程之间的角度裕度,以克服已经学习(基地)和新添加(新)类的课堂混淆的挑战。我们对挑战印度驾驶数据集(IDD)进行了实验,这呈现了一个现实世界类别 - 不平衡的环境,与流行的FSOD基准Pascal-VOC相同。我们的方法优于最先进的(SOTA)在IDD-OS上最多可达6.4个地图点,并且在IDD-10上的2.0次映射点为10次拍摄设置。在Pascal-Voc数据集上,我们优先于现有的SOTA方法,最多可达4.9个地图点。
The rise in data has led to the need for dimension reduction techniques, especially in the area of non-scalar variables, including time series, natural language processing, and computer vision. In this paper, we specifically investigate dimension reduction for time series through functional data analysis. Current methods for dimension reduction in functional data are functional principal component analysis and functional autoencoders, which are limited to linear mappings or scalar representations for the time series, which is inefficient. In real data applications, the nature of the data is much more complex. We propose a non-linear function-on-function approach, which consists of a functional encoder and a functional decoder, that uses continuous hidden layers consisting of continuous neurons to learn the structure inherent in functional data, which addresses the aforementioned concerns in the existing approaches. Our approach gives a low dimension latent representation by reducing the number of functional features as well as the timepoints at which the functions are observed. The effectiveness of the proposed model is demonstrated through multiple simulations and real data examples.
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In particular, there are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers. Recent work has begun to introduce such benchmark datasets for several tasks. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. We follow the blueprint of the Spoken Language Understanding Evaluation (SLUE) benchmark suite. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will be publishing for each task (i) annotations for a relatively small fine-tuning set, (ii) annotated development and test sets, and (iii) baseline models for easy reproducibility and comparisons. In this work, we present the details of data collection and annotation and the performance of the baseline models. We also perform sensitivity analysis of pipeline models' performance (speech recognizer + text model) to the speech recognition accuracy, using more than 20 state-of-the-art speech recognition models.
Drawing from the resources of psychoanalysis and critical media studies, in this paper we develop an analysis of Large Language Models (LLMs) as automated subjects. We argue the intentional fictional projection of subjectivity onto LLMs can yield an alternate frame through which AI behaviour, including its productions of bias and harm, can be analysed. First, we introduce language models, discuss their significance and risks, and outline our case for interpreting model design and outputs with support from psychoanalytic concepts. We trace a brief history of language models, culminating with the releases, in 2022, of systems that realise state-of-the-art natural language processing performance. We engage with one such system, OpenAI's InstructGPT, as a case study, detailing the layers of its construction and conducting exploratory and semi-structured interviews with chatbots. These interviews probe the model's moral imperatives to be helpful, truthful and harmless by design. The model acts, we argue, as the condensation of often competing social desires, articulated through the internet and harvested into training data, which must then be regulated and repressed. This foundational structure can however be redirected via prompting, so that the model comes to identify with, and transfer, its commitments to the immediate human subject before it. In turn, these automated productions of language can lead to the human subject projecting agency upon the model, effecting occasionally further forms of countertransference. We conclude that critical media methods and psychoanalytic theory together offer a productive frame for grasping the powerful new capacities of AI-driven language systems.
